37 research outputs found
Utilising Tree-Based Ensemble Learning for Speaker Segmentation
Part 2: Learning-Ensemble LearningInternational audienceIn audio and speech processing, accurate detection of the changing points between multiple speakers in speech segments is an important stage for several applications such as speaker identification and tracking. Bayesian Information Criteria (BIC)-based approaches are the most traditionally used ones as they proved to be very effective for such task. The main criticism levelled against BIC-based approaches is the use of a penalty parameter in the BIC function. The use of this parameters consequently means that a fine tuning is required for each variation of the acoustic conditions. When tuned for a certain condition, the model becomes biased to the data used for training limiting the model’s generalisation ability.In this paper, we propose a BIC-based tuning-free approach for speaker segmentation through the use of ensemble-based learning. A forest of segmentation trees is constructed in which each tree is trained using a sampled version of the speech segment. During the tree construction process, a set of randomly selected points in the input sequence is examined as potential segmentation points. The point that yields the highest ΔBIC is chosen and the same process is repeated for the resultant left and right segments. The tree is constructed where each node corresponds to the highest ΔBIC with the associated point index. After building the forest and using all trees, the accumulated ΔBIC for each point is calculated and the positions of the local maximums are considered as speaker changing points. The proposed approach is tested on artificially created conversations from the TIMIT database. The approach proposed show very accurate results comparable to those achieved by the-state-of-the-art methods with a 9% (absolute) higher F1 compared with the standard ΔBIC with optimally tuned penalty parameter
A Contextual Study of Semantic Speech Editing in Radio Production
Radio production involves editing speech-based audio using tools
that represent sound using simple waveforms. Semantic speech editing systems allow users to edit audio using an automatically generated
transcript, which has the potential to improve the production workflow. To investigate this, we developed a semantic audio editor based
on a pilot study. Through a contextual qualitative study of five professional radio producers at the BBC, we examined the existing radio
production process and evaluated our semantic editor by using it to
create programmes that were later broadcast.
We observed that the participants in our study wrote detailed notes
about their recordings and used annotation to mark which parts they
wanted to use. They collaborated closely with the presenter of their
programme to structure the contents and write narrative elements.
Participants reported that they often work away from the office to
avoid distractions, and print transcripts so they can work away from
screens. They also emphasised that listening is an important part
of production, to ensure high sound quality. We found that semantic speech editing with automated speech recognition can be used to improve the radio production workflow, but that annotation, collaboration, portability and listening were not well supported by current
semantic speech editing systems. In this paper, we make recommendations on how future semantic speech editing systems can better
support the requirements of radio production
Recommended from our members
Is voice a marker for autism spectrum disorder? A systematic review and meta-analysis
Individuals with Autism Spectrum Disorder (ASD) tend to show distinctive, atypical acoustic patterns of speech. These behaviours affect social interactions and social development and could represent a non-invasive marker for ASD. We systematically reviewed the literature quantifying acoustic patterns in ASD. Search terms were: (prosody OR intonation OR inflection OR intensity OR pitch OR fundamental frequency OR speech rate OR voice quality OR acoustic) AND (autis* OR Asperger). Results were filtered to include only: empirical studies quantifying acoustic features of vocal production in ASD, with a sample size > 2, and the inclusion of a neurotypical comparison group and/or correlations between acoustic measures and severity of clinical features. We identified 34 articles, including 30 univariate studies and 15 multivariate machine-learning studies. We performed metaanalyses of the univariate studies, identifying significant differences in mean pitch and pitch range between individuals with ASD and comparison participants (Cohen’s d of 0.4-0.5 and discriminatory accuracy of about 61-64%). The multivariate studies reported higher accuracies than the univariate studies (63-96%). However, the methods used and the acoustic features investigated were too diverse for performing meta-analysis. We conclude that multivariate studies of acoustic patterns are a promising but yet unsystematic avenue for establishing ASD markers. We outline three recommendations for future studies: open data, open methods, and theory-driven research
Audio source separation into the wild
International audienceThis review chapter is dedicated to multichannel audio source separation in real-life environment. We explore some of the major achievements in the field and discuss some of the remaining challenges. We will explore several important practical scenarios, e.g. moving sources and/or microphones, varying number of sources and sensors, high reverberation levels, spatially diffuse sources, and synchronization problems. Several applications such as smart assistants, cellular phones, hearing aids and robots, will be discussed. Our perspectives on the future of the field will be given as concluding remarks of this chapter
The ICSI RT-09 Speaker Diarization System
Item does not contain fulltext11 p
Tools for multimodal annotation
Researchers interested in the sounds of speech or the physical gestures of Speakers make use of audio and video recordings in their work. Annotating these recordings presents a different set of requirements to the annotation of text. Special purpose tools have been developed to display video and audio Signals and to allow the creation of time-aligned annotations. This chapter reviews the most widely used of these tools for both manual and automatic generation of annotations on multimodal data